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Abstract —A particular case of Recurrent Neural Network 
(RNN) was introduced at the beginning of the 2000s under the 
name of Echo State Networks (ESNs). The ESN model overcomes 
the limitations dnring the training of the RNNs while introducing 
no significant disadvantages. Although the model presents some 
well-identified drawbacks when the parameters are not well 
initialized. The performance of an ESN is highly dependent on 
its internal parameters and pattern of connectivity of the hidden- 
hidden weights Often, the tuning of the network parameters can 
be hard and can impact in the accuracy of the models. 

In this work, we Investigate the performance of a specific 
boosting technique (called L 2 -Boost) with ESNs as single predic¬ 
tors. The L 2 -Boost technique has been shown to be an effective 
tool to combine “weak” predictors in regression problems. In 
this study, we use an ensemble of random initialized ESNs 
(without control their parameters) as “weak” predictors of the 
boosting procedure. We evaluate our approach on five well-know 
time-series benchmark problems. Additionally, we compare this 
technique with a baseline approach that consists of averaging the 
prediction of an ensemble of ESNs. 

AeyM'ords-L 2 -boosting, Echo State Network, Time-series mod¬ 
eling, Reservoir Computing, Ensemble Methods 


I. Introduction 

Boosting is a general procedure for improving the accuracy 
of an ensemble of methods. It has been successful used 
in supervised learning problems since its apparition in the 
1990s |T|-|31. Several variations of the original Boosting idea 
have been introduced over the years Q, Q, one of the most 
popular is called AdaBoost Q. At the beginning. Boosting 
was used in problems where the output features were label 
or discrete responses (classification problems). An analogy 
between AdaBoost and additive models was studied in Q- 
This connection was essential for the extension of Boosting 
for solving problems where the output features are continuous 
variables (regression problems). Biihlmann et al. developed 
a variation of the Boosting technique called L 2 -Boost that is 
constructed from an additive model and the functional gradient 
descent method Q. 

Since the early 2000s, a computational paradigm called 
Reservoir Computing (RC) has gained prominence in the 
Neural Computation community. In a RC model there are two 
well-separated concepts: a dynamical system and a memory¬ 
less function. The purpose of the dynamical system is to 


encode the spatio-temporal information of the input patterns 
into a spatial representation. At each time this dynamical 
system is characterized by its state that is called reservoir 
in the RC literature. This non-linear transformation is most 
often realized by a Recurrent Neural Network (RNN) with a 
large pool of interconnected neurons. A distinctive principle 
of a RC model is that the parameters of the dynamical system 
(the RNN weights) do not participate in the training process. 
That is, once the reservoir parameters are initialized, they 
remain fixed during the training process. Another part of the 
model is a memory-less supervised learning tool called readout 
structure. This part is designed to be robust and fast in the 
learning process. 

The RC models has been applied in the neuroscience area 
for processing cognitive information in the neural system. |]^. 
Furthermore, they have proven to be extremely effective tools 
for time-series problems in the area of Machine Learning. For 
instance, as far as we know one of the most popular RC models 
the Echo State Networks (ESN) 0, has the best known learn¬ 
ing performance on the Mackey-Glass time-series prediction 
problem In this article, we will concentrate in the ESN 

model for solving time-series problems. The reservoir in the 
ESN model is composed by a RNN with sigmoid neurons, and 
the readout part of the model is a linear regression. The weight 
connection between neurons in the reservoir are collected in 
a matrix that we will call reservoir matrix. The main global 
parameters of the ESN model are: the input scaling factor, 
the spectral radius of the reservoir matrix and the pattern of 
connectivity among the reservoir units. The setting of these 
parameters often requires the human expertise and several 
empirical trials 110|. As a consequence, the setting procedure 
can be expensive in computational time. Eor instance, the time 
complexity of an algorithm that computes the spectral radius 
of a N X N matrix is equal to 0{N^) 0- 

The goal of this article is to investigate the performance 
of an automatic procedure to combine single weak ESNs and 
the L 2 -Boost technique. We develop an automatic technique 
based on L 2 -Boost, which combines the prediction of several 
random initialized ESNs in order to produce a highly accurate 
tool. We use the term weak ESN for an ESN without checking 
and computing the spectral radius. We use the terminology 
weak due to the fact that this particular ESN is not optimal. 



The main advantages of the procedure presented in this article 
are; 


• Descend the computational effort. In order to gain in the 
computational effort, the approach consists of combining 
weak ESNs. The procedure avoids to tune the reservoir 
parameters, which can often have a high computational 
cost. It uses only a uniform random initialization of the 
weights. Note that, there are not a control of the spectral 
radius, then some single weak ESNs can have unstable 
dynamics. 

• The procedure is automatic. The procedure does not 
require external human expertise for setting the model 
parameters, and for evaluating the model performance. 

• The technique has a new parameter used for overfitting 
control. This parameter which we will that comes from 
the L 2 -Boost technique. 

We present empirical results of the procedure introduced 
in this paper on a wide range of benchmark problems. We 
compare these performances with the accuracy obtained by a 
single ESN. Furthermore, we realize a comparison with the 
accuracy of a baseline approach that computes the average 
among single ESN models. Each single ESN is independently 
initialized and adjusted during the learning process. There are 
empirical evidence in the Machine Learning literature that 
show that this baseline approach sometimes performs better 
than other ensemble methods CD- 

This work is a revised and expanded version of the arti¬ 
cle |[T3|. 

The structure of this article is organized as follows. In 
Section]^ we start with a specification of supervised learning 
problems with temporal data. Next, we present an overview 
about the family of additive models. Subsection |II-C| intro¬ 
duces the L 2 -Boost technique. In Section III we present the 
Reservoir Computing paradigm. Particularly, we focus on the 
Echo State Network model in IIII-AI In Subsection IIII-BI is 
presented a formalization of the procedure introduced in this 
article. Section|IV]describes the empirical results. This section 
starts with a description of the benchmark problems. Next, 
we present the reached results. Finally, last section provides 
conclusions and future work. 


II. Background 

In this Section, we start specifying the context where the 
ESN model and the L 2 -Boost technique are applied. Next, we 
present the additive models and we introduce a description of 
the L 2 -Boost technique. 


A. Problem Specification 

We begin specifying a supervised learning problem. Given 
a data set C = : t — where the 

points X and y are either a class or a numerical response. 
We denote by the dimension of the input vector x, 
and Ny the dimension of the output vector y. We suppose 
that the mapping between the input x and the output y is 
given by certain unknown function F{-). The goal consists 
in learning a parametric function F{x^*\L) such that certain 
error distance between F{x ^*'>, L) and y^*) is minimized for all 


t. The problem is called regression problem when the learning 
set has output numerical variables. Otherwise, it is named 
classification problem. In the case of regression problems, it 
is recommended to use the a quadratic distance G3- Even 
though we can also use a quadratic distance in classification 
problems, it is recommendable to use the Kullback-Leibler 
distance in this domain HD- 

An ESN model is mainly used for solving supervised learn¬ 
ing tasks, wherein the data set presents temporal dependencies. 
Although, it can be also used for non-temporal supervised 
learning problems Q. In this article we will concentrate 
only in temporal learning tasks with real output variables 
(y(*) e for all t). In this work, we perform the models 

using a standard discrete time. We want to forecast some 
aspect of the output feature y at time t + k, using some aspect 
of the information available at current time t, that is given the 
collection ((x(*\y^*)), ..) 

we would like to predict the value y(*+'') [k > 0) |15j . In 
this case, the goal consists in estimating a mapping F(-) for 
predicting for some fc > 0, such that some distance 

between F(-) and y is minimized. 


B. Additive Models 

In 0 was analyzed the Boosting model under the form of an 
Additive model. Given a set of functions —>■ 

m = 1..., M characterized by a set of parameters 0 and 
expansion coefficients /?, 

an additive model has the following form 

M 

i^(x)=^/W(x). (1) 

m—1 

The functions {h{x;9)}i^ are named basis functions. They 
are not fixed a priori and are selected depending of the cost 
function used and the data set. An important parameter of the 
model is the number of basis functions (M) considered in 
the expression Q- This parameter controls the generalization 
error of the model. Since the main goal in a learning task is to 
find a predictor with low generalization error, the parameter 
M has an important role in the accuracy of an additive model. 

C. The L 2 -Boost Procedure 

A relationship between the gradient descent technique and 
stage-wise additive expansions was introduced at the beginning 
of the 2000s GD- The introduction of the gradient descent 
algorithm using a boosting approach was an essential con¬ 
tribution in the field of ensemble learning methods (ig. It 
allowed to start to use boosting in regression problems 0. 
A Boost method for regression problems with quadratic error 
distance was introduced under the name of L 2 -Boost in 0. We 
present the L 2 -Boost technique in Algorithm^ Other boosting 
variants were presented for other kind of distances, some of 
them are described in 0,0, m- 

We refer by epoch to the iteration of the training algorithm 
through all the patterns in the training set GD- At each epoch 









m + 1, the basis function is fitted to the current 

residuals: for all i. Unlike other boosting 

techniques such as Adaboost, L 2 -Boost does not present any 
re-weighting. Another difference between L 2 -Boost and other 
boosting methods is that L 2 -Boost presents a tendency to over¬ 
fit the data Q. The model with contracting linear learners 
converge to the fully saturated model Q. Each boosting 
epoch contributes to additional overfitting, thus the selection 
of the weak learners and the parameter M is an essential 
task for this device. In practice, few boosting iterations are 
enough to achieve good performances avoiding the overfitting 
phenomena. 


Algorithm 1 The L 2 -Boost algorithm. 

Require: L, M, h{x,d) 

Fit an initial model using a least squares fit (see d)): 
for (to = 1,... M) do 

Compute the residuals for all pattern i: 

e(i) = y(i) _ f(’”)(x(*)); 

Fit the model parametrized as 

j(”^-i-i)(x) = h{x,9) to the current residuals e using the 
least squares fit; 

Update: F(™+i)(.) = fM(.) + 

end for 

Return the function; 


III. Modeling Time-series with Echo State 
Networks 

The Recurrent Neural Networks (RNNs) are powerful tools 
for solving time-series benchmarks. They are computational 
methods that operate in time. Considering terminology of 
graphs, in a RNN at least one circuit is presented in its 
topology. The circuits of the network enable to store temporal 
information, in order to learn and memorize the input his¬ 
tory Each circuit creates an internal state which makes the 
recurrent network a discrete time state-space model. At each 
time, the RNN receives an input pattern. Next, the network 
updates its hidden state via a non-linear activation function 
using the input pattern and the network state at the precedent 
time d). There are a general consensus in the community 
that considers the RNN as powerful tool for forecasting and 
time-series prediction. 

In spite of that, in practice the model presents some draw¬ 
backs. The most important is that is hard to train a RNN 
using gradient descent methods | |20| . The training methods 
that use the first differential information have often stability 
problems and high numerical complexity. As a consequence, 
much longer training times are necessary to adjust the network 
weights. In | [20) is analyzed the main limitations of the algo¬ 
rithms of the gradient descent type for training RNNs. These 
drawbacks are identified under the names of vanishing and 
the exploding gradient problems pO) . The vanishing gradient 
phenomena occurs when the norm of the gradient decreases 
arbitrarily fast to 0. The exploding gradient phenomena refers 


to the opposite, when the gradient norm large increases during 
the training process ID- Recently, an effective algorithm 
to train RNN was introduced 0, the algorithm uses the 
Hessian-free Optimization for setting the network parameters. 

Reservoir Computing (RC) models appear as a good alterna¬ 
tive for RNNs. The two pioneering RC models are Echo State 
Network (ESN) Q and Liquid State Machine (LSM) p^ . This 
computational paradigm covers the main limitations related to 
learning processes in RNNs obtaining acceptable performance 
in practical applications 0. In a RC model there are at least 
two well-differentiated structures: a dynamical system called 
reservoir and another one called readout. The readout is a 
supervised learning tool for training with non-temporal data. 
For example: feedforward neural network, linear regression, 
decision trees, etc. A main characteristic of a RC model is 
that the weights involved in circuits are deemed fixed during 
the learning process. Thus, the matrix with the weight between 
reservoir units (reservoir matrix) is initialized in an arbitrary 
way and it remains unchanged during the learning process. 
The training algorithm is restricted to update the weights in 
the readout structure. Over the last years several kinds of 
dynamical systems have been used for generating the reservoir 
state, models include: Backpropagation-decorrelation Recur¬ 
rent Learning p3| . Leaky Integrator Echo State Networks 
studied p4) , Evolino 0, Intrinsic Plasticity p5] . Echo State 
Queueing Networks p^ . Reservoir Computing and Extreme 
Learning p7) , and so on. 

A. Formalization of the Echo State Network Model 

In this work related to the L 2 -Boost technique and the RC 
methods, we only study the L 2 -Boost with the ESN model. 
An ESN reservoir is a RNN from an input space into 
a larger space R^" with AQ ^ N^. The connection between 
input and hidden neurons are collected in a Ag x weight 
matrix w™. The connections among the hidden neurons are 
represented hy a NgX Ns weight matrix w''. A Ny x Ng weight 
matrix ■w°“* represents the readout weights. At any time t, the 
information from the input pattern and the past is represented 
in a state vector 

s{t) = tanh(w“x(*) -f (2) 

At any time t, the output prediction G R^^ is generated 
using the input pattern and the reservoir state information. 
Most often is computed using a linear regression: 

yW = W°“*[xW|sW], (3) 

where -j- is the vertical concatenation of the vectors. For 
the sake of the notation simplicity, we omit the bias term, 
a constant term is included in all the regressions. 

In 0 was analyzed the stability of the reservoir dynamics 
in the ESN model. Under certain algebraic conditions the 
reservoir state only depends (asymptotically) of the inputs 
and the network topology. It becomes independent of its 
initial conditions 0- These conditions were summarized in 
the Echo State Property (ESP) 0 - In practice, the stability 
of the ESN is almost always ensured when the spectral 
radius of the reservoir matrix is less than 1 0, ||28l. As a 





consequence, the reservoir weights are appropriately scaled 
in order to have a spectral radius less than 1. To scale the 
parameters is necessary to compute the spectral radius of the 
reservoir matrix. The computation of the spectra requires an 
important computational effort d). Some attempts to generate 
a procedure for initializing the RC models were introduced 
in nig, l|2g-||n|. 

B. L 2 -Boost Using the ESN Model for Time-series Processing 
Information 

In this article, we investigate the performance of using L 2 - 
Boost in temporal learning tasks, and we consider as weak 
learner predictors a set of ESNs with random initialization. 
Given an arbitrary parameter M the procedure is as follows. 
We initialize an ESN in a random way. The initialization 
consists in selecting the size of the network as well as the 
pattern of connectivity. We consider a reservoir with fixed 
sparse connections. We do not control the spectrum norm of 
the reservoir weight. A guide about the initialization procedure 
can be seen from We expand the input information 

using the ESN reservoir given by the expression (|^, thus 
we obtain Vt. Next, we apply Algorithm Einally, we 
obtain predictor The approach is summarized in 

Algorithmic In our experiments we use ridge linear regression 
for computing the readout weights w°“* ||^. 

Algorithm 2 The L2-Boost with the ESN model. 

Initialize an ESN following the comments in Subsec¬ 
tion |IM1 

Compute the temporal expansion of L using 0; 

Generate the set 
Apply the Algorithmic 
Return 


In order to evaluate the performance of this procedure, 
we compare the reached accuracy of the L2-Boost technique 
with a simple baseline approach GZl- The baseline approach 
consists in combining K single predictors (in our case the 
learning predictors are ESNs). We consider random initialized 
reservoirs, without control of the reservoir spectrum norm. In 
the baseline method, we train independently each of these 
single ESNs. The hnal prediction is the average among the 
single predictions. 

Eor statistical comparisons between the methods we con¬ 
sider K = 30. We remark again that we do not scale the 
reservoir weights for obtaining the ESR Even though some 
ESN models can present good accuracy, other ones can be 
weak predictors. Additionally, we compare our performances 
with the performance obtained when single ESNs with “good” 
tuning of the reservoir parameters are used. Eor that, we use 
the results presented in the RC literature. 

IV. Empirical Results 

We begin this section describing the benchmark problems. 
Next, we specify the experimental setup. We concludes this 
section with an analysis of our empirical results. 


A. Description of the Benchmark Problems 

We use the following range of time-series benchmarks: 

• Eixed fcth order NARMA. This data set presents a high 
non-linearity and is widely used in the RC literature. 
We generate the NARMA serie following the description 



k-l 

b{t + 1) = ai{t) a 2 b{t)'^ b{t — i) 

i=0 

-|-Q:3s(f — {k — l))s(f) + a4, 
where s(t) ^ 17ni/[0, 0.5] and the constants values are 
shown in Table U) In order to evaluate the memorization 
ability of the model, we consider two simulated series 
when A: = 10 and k = 30. The task consists to predict 


k 

ai 

a 2 

0:3 

0.4 

10 

0.3 

0.05 

1.5 

0.1 

30 

0.2 

0.004 

1.5 

0.001 


Table I: Parameters considered for the fixed /cth order NARMA 
serie with k = 10 and k = 30. 

the value based on the history of y{t) up to time 

t. We used the hrst 200 samples as initial washout in 
both the training and test procedure. The regularization 
parameter used was 0.00001. 

• The Santa Ee Laser data set p^ . It is an experimental 
data that contains the intensity pulsations of a real laser 
recorded by a LeCroy oscilloscope. The data is a cross¬ 
cut through periodic to chaotic intensity laser pulsations. 
These pulsations more or less follow the theoretical 
Lorenz model of a two level system pT| . In this problem, 
the task consists to predict the next measure y{t -f 1), 
given the precedent values up to t. The original data only 
consists of 1000 measurements, we used for training 499 
samples and for testing 500 samples. We used a washout 
of 10 samples. The regularization parameter 7 was 0.001. 

• Henon Map data set. It is a prototypical invertible map 
with chaotic solutions proposed in ID- The data is 
generated by 

= 1- lA{y{t)f -f 0.3y{t - 1) -f z{t + 1), 

where the noise is z{t) ~ N(0,0.05). The data is 
normalized in [0, Ij. The goal is to predict the next value 
y{t-\-l) with the past information up to t. We considered 
a training data with 3995 samples and a test data with 795 
samples. The regularization parameter 7 used was 0.001. 
We use an initial washout composed by 100 samples. The 
network topology has 3 input units set with the last two 
precedent y{t) values and the noise at current time. 

• Ereedman’s non linear time data set ID- The data is 
generated by 

1) =g{y{t)), 

where: 

. , f 2x, ifx < 0.5, 

= t>0.5. 

We consider a very short data set. The length of the 















training data was 30 and the test size was 19. The initial 
value is y(0) — 0.23719. The initial washout considered 
only was of 3 samples. The network topology has only 
one input unit, several reservoir units and one output unit. 
The regularization parameter 7 used was 0.001. 


B. Experimental Setup 


DATA 

Initial 

Washout 

7 

Train 

samples 

Test 

samples 

10th NARMA 

200 

0.00001 

1400 

2400 

30th NARMA 

200 

0.00001 

1600 

2600 

Santa Ee Laser 

10 

0.001 

499 

500 

Henon Map 

100 

0.001 

3995 

795 

Ereedman’s 

3 

0.001 

30 

19 


Table II; Parameter setting of the benchmark problems. In all 
cases, we initialize the input weights (w'”) using Uniform 
distribution in [— 0 . 2 , 0 . 2 ], and we initialize the reservoir 
weights with an Uniform distribution in [—0.8,0.8]. 

We summarize the setting of the main parameters related 
to the benchmark problems in Table |I^ The table presents 
the initial washout period, the regularization parameter (7) of 
the linear ridge regression, and the number of train and test 
samples for each benchmark problem. 

The benchmarks selected have been widely used in the 
RC literature 0, @, ||3T|, @. In all cases, we use 
the Normalized Mean Square Error (NMSE) as measure of 
accuracy model 0 . The learning method used for computing 
the output weight matrix was the offline ridge regression. 
This algorithm has a regularization parameter 7 that we adjust 
it for each benchmark problem. The pre-processing data step 
consisted in normalizing the patterns in the interval [0,1] We 
investigated the algorithm performance for several reservoir 
sizes. The range of the reservoir size values is specified for 
each benchmark problem. The connection between the input 
and reservoir layer is fully connected with random weights in 
[—0.2, 0.2]. The reservoir matrix is initialized using Uniform 
distribution in [— 0 . 8 , 0 . 8 ]. 


C. Result Analysis 

Table [nl| shows results reported in the RC literature for these 
benchmarks when a single ESN model was used as model 
predictor. Table [IV| presents the train set accuracy reached 
on the Henon Map data set. The columns 2, 3 and 4 show 
the NMSE reached with L 2 -Boost with ESNs for M epochs 
{M = 3, 4 and 5), respectively. Column 5 of Table III shows 
the accuracy of the baseline approach, that it averaging the 
prediction of 30 ESNs. The columns of the table are written 
using a scientihc notation. 

Table IV illustrates the accuracy of the models during the 
training. The NMSE corresponds to the training data of the 
Henon Map data set. We present this table in order to illustrate 
the tendency of overfitting of L 2 -Boost with ESN. The additive 
model converge very fast to the solution, for this reason the 


columns 3 and 4 of Table [^ are very similar. The model 
with larger M performs better over the train data, but it has 
problems of generalization. We found this characteristic in 
all benchmarks. As a consequence, we can affirm that the 
parameter M has a relevant impact in the control of the 
overfitting phenomenon. We can found a similar remarks for 
the L 2 -Boost technique in non-temporal learning tasks 0. 

Eigure 0 illustrates the NMSE reached according the reser¬ 
voir size for different M values for the 30th NARMA data set. 
This hgure shows the training error, we can see the evolution 
of the NMSE versus the size of the reservoir. We present few 
values of reservoir size between 6 till 11. Eigures 0 and [^ 
show the NMSE of the test data versus the reservoir size for 
the 10th and 30th order NARMA data set, respectively. These 
hgures show 4 curves, the black one (with points represented 
by dots) corresponds to the baseline method which combines 
several single ESNs. The other curves correspond to the L 2 - 
Boost-ESN with different number of epochs M = 6,8 and 
M = 10. We can not affirm that the procedure of L 2 -Boost 
with single weak ESNs performs better than optimal single 
ESNs. The accuracy it is also of the same order that results 
presented in the RC literature using a single well-initialized 
ESN ini, |3l). 

Eigure |4 illustrates the evolution of the NMSE for the 
reservoir size of the test data of the Santa Ee Laser benchmark. 
The error was computed for the L 2 -Boost with ESNs for 
M = 4, 6 and M = 8 and the baseline approach averaging 30 
ESNs. Eigure 0 shows the accuracy reached for the models on 
the Ereedman test data set. The graphic shows the evolution 
of the L 2 -Boost with ESNs for M = 4,5 and M = 5 and 
the baseline approach. In all graphics, we can see that when 
the reservoir increases its size the procedure L 2 -Boost with 
ESNs and the baseline approach decrease their test error. This 
behavior about the impact of the reservoir size on the accuracy 
of the model, also happens with single ESNs |Tg, 


Data Accuracy Ax Ref. 


10th NARMA 0.166 (NMSE) 50 

0.0425 (NMSE) 200 

28 

28 


30th NARMA 0.4542 (NRMSE) 100 



Santa Ee Laser 0.0184 (NMSE) 50 

0.00819 (NMSE) 200 

28 


Henon Map 0.00975 (NMSE) 50 

0.00868 (NMSE) 200 
Ereedman’s 0.0004302 (MSE) 40 

28 

3T 



Table III: Accuracy of the ESN model for the benchmark 
problems. Second column shows the accuracy reached by the 
single ESN, third column refers the reservoir size and the last 
column shows a bibliographic reference. In the case of the 
Ereedman’s non linear time data, the reservoir initialization 
was done using the Scale Invariant Map method | [3T) , and 
the Mean Square Error (MSE) was the error measure. In the 
case of 30th NARMA, the authors initialize the reservoir using 
permutation matrices. The error measure was the Normalized 
Root Square Error (NRMSE) p0[ . In the other benchmarks 
problems, the authors control some reservoir parameters such 
as: spectral radius and reservoir matrix density. 



























M = 3 

(I.Oe-12) 

M = 4 

(I.Oe-13) 

M = 5 

(I.Oe-13) 

Bas. ESN 
(I.Oe-12) 

6 

0.297367 

0.131036 

0.131036 

0.288720 

7 

0.191883 

0.131036 

0.131036 

0.064412 

8 

0.142289 

0.131036 

0.131036 

0.572728 

9 

0.160910 

0.131036 

0.131036 

0.182942 

10 

0.203845 

0.131036 

0.131036 

0.278237 

11 

0.272649 

0.131036 

0.131035 

0.524017 

12 

0.488265 

0.131036 

0.131036 

0.459002 

Table IV 

; Train set performance of the Henon Map data 


First column indicates the number of neurons in the reservoir. 
The columns 2, 3 and 4 show the NMSE obtained with L^- 
Boost with M epochs (M = 3, 4 and 5), respectively. Column 
5 shows the accuracy of the baseline approach. That is the 
accuracy average among 30 ESN predictions. The columns 
are written using a scientific notation. 


10 order Narma test data 



Eigure 1: Test set accuracy reached for the 10th order NARMA 
data set. The vertical axis of the graph shows the NMSE 
accuracy, and the horizontal axis presents some values of 
the reservoir size. We compare the accuracy of L 2 -Boost; 
M = 4,5 and 6 with the baseline approach averaging 30 
ESNs. 


V. Conclusions and Euture Work 

At the beginning of the 2000s, an efficient technique to 
train and design a RNN was developed under the name 
of Echo State Network (ESN). This approach overcome the 
limitations to train RNN using the gradient descent method. 
The performance of an ESN is highly dependent on its 
parameters and pattern of connectivity of the hidden-hidden 
weights Besides, the network setting can be computational 
expensive, in particular to compute the spectral radius of the 
hidden-hidden weight matrix. 

In this article, we investigated boosting ideas with ESNs, in 
order to built a robust new learning tool. In particular, we 
studied the utilization of L 2 -Boost with random initialized 
ESNs. We merge a set of weak single ESNs. We call weak 
ESNs because they are random initialized, and we do not use 
extra computational effort for tuning the initial hidden-hidden 
weights. 


30 order Narma Train data 



Eigure 2: The accuracy reached on the training set of the 30th 
order NARMA data. The vertical axis of the graph shows the 
NMSE accuracy, and the horizontal axis presents some values 
of the reservoir size. We compare the accuracy of L 2 -Boost: 
M = 6,8 and 10 with the baseline approach averaging 30 
ESNs. 



Eigure 3: The accuracy reached for the 30th order NARMA 
data set. The vertical axis of the graph shows the NMSE 
accuracy, and the horizontal axis presents some values of 
the reservoir size. We compare the accuracy of L 2 -Boost; 
M = 6,8 and 10 with the baseline approach averaging 30 
ESNs. 


In spite of the realization of numerous tests, we can not 
affirm that L 2 -Boost with ESNs performs better than a single 
well-initialized ESN (according the results presented in the RC 
literature). However, the main advantage of the L 2 -Boost with 
weak ESNs is that the procedure is automatic and does not 
require the computational effort of computing the spectra of 
the hidden-hidden weight matrix. Additionally, the procedure 
has a control parameter for the overfitting phenomena. 

In a future work we will test the model using another 
technique for decrease the generalization error, as well as on 
a more number of benchmark problems. Additionally, we can 
















































References 


Laser test data 



Figure 4: Test set accuracy of the Santa Fe Laser data. The 
vertical axis of the graph shows the NMSE accuracy, and the 
horizontal axis presents some values of the reservoir size. We 
compare the accuracy of L 2 -Boost: M = 4, 6 and 8 with a 
baseline approach averaging 30 ESNs. 


Freedman test data 



Eigure 5: Test set performance of the Ereedman data set. 
Size of the reservoir versus the NMSE accuracy. We compare 
the accuracy of L 2 -Boost: M = 4, 5 and 6 with a baseline 
approach averaging 30 ESNs. 


test the approach using another supervised learning tool for 
the readout structure. 
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